Distributed Duplicate Detection in Post-Process Data De-duplication
نویسندگان
چکیده
Data De-duplication is essentially a data compression technique for elimination of coarse-grained redundant data. A typical flavor of de-duplication detects duplicate data blocks within the storage device and de-duplicates them by placing pointers rather than storing multiple copies at various places within the disk. Since the advent of deduplication the conventional approach has been to scale-up de-duplication at a storage controller by using more of the controller resources. This approach has led to several bottlenecks including the most evident one of hogging controller resources, in-turn leading to limiting the number of concurrent de-duplication threads running on the controller, finally ending up with poor de-duplication performance. Going by the rate at which we are experiencing data explosion, with data becoming the core entity separating one organization from other, high performing scalable de-duplication is one challenge organizations are already starting to face. Through the current effort, we propose a scalable design of a distributed de-duplication system which leverages clusters of commodity nodes to scale-out suitable tasks of a typical de-duplication system. We explain our distributed duplicate detection workflow, implemented in Hadoop’s map-reduce programming abstraction. We also discuss the performance statistics we obtained with the scale-out de-duplication model.
منابع مشابه
Cloud Based Data Deduplication with Secure Reliability
IJRAET Abstract— To eliminate duplicate copies of data we use data de-duplication process. As well as it is used in cloud storage to minimize memory space and upload bandwidth only one copy for every file stored in cloud that can be used by more number of users. Deduplication process helps to improve storage space. Another challenge of privacy for sensitive data also arises. The aim of this pap...
متن کاملDuplicate Web Pages Detection with the Support of 2d Table Approach
Duplicate and near duplicate web pages are stopping the process of search engine. As a consequence of duplicate and near duplicates, the common issue for the search engines is raising the indexed storage pages. This high storage memory will slow down the process which automatically increases the serving cost. Finally, the duplication will be raised while gathering the required data from the var...
متن کاملCluster Based Duplicate Detection
We propose a clustering technique for entropy based text dis-similarity calculation of de-duplication system. Improve the quality of grouping; in this study we propose a Multi-Level Group Detection (MLGD) algorithm which produces a most accurate group with most closely related object using Alternative Decision Tree (ADT) technique. Our propose a two new algorithm; first one is Multi-Level Group...
متن کاملRefConcile - Automated Online Reconciliation of Bibliographic References
Comprehensive bibliographies often rely on community contributions. In such settings, de-duplication is mandatory for the bibliography to be useful. Ideally, de-duplication works online, i.e., when adding new references, so the bibliography remains duplicate-free at all times. While de-duplication is well researched, generic approaches do not achieve the result quality required for automated re...
متن کاملA New Method for Duplicate Detection Using Hierarchical Clustering of Records
Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of ...
متن کامل